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A  key  feature  of  many  new  computer  architectures  is  that  they  are  composed  of  multiple  tiles,  each  of  which  is 
a  fully  capable  processor.  Tiled  architectures  are  attractive  alternatives  to  monolithic  computer  architecture  designs 
because  they  allow  a  larger  design  to  be  built  from  smaller  modules  and  limit  the  number  of  wires  that  need  to  span 
the  entire  chip.  Examples  of  tiled  architectures  include  many  under  development  for  the  DARPA/IPTO  Polymor¬ 
phous  Computer  Architectures  (PC A)  program,  including  the  MIT  Raw  machine  [13],  the  Stanford  Smart  memories 
project  [8],  and  the  University  of  Texas  TRIPS  machine  [9]. 

The  decoupled  systolic  architecture  (DSA)  represents  a  canonical  abstract  machine  that  encompasses  many  of 
the  key  features  of  single-chip  tiled  architectures  [4]  including  the  PCAs  and  other  emerging  architectures  such  as 
Scale  [6],  Wavescalar  [12],  and  Synchroscalar  [10].  Stream  algorithms  are  defined  as  the  family  of  algorithms  which 
can  achieve  100  %  computational  efficiency  on  the  DSA.  The  DSA  and  stream  algorithms  provide  a  rigorous  analytical 
framework  for  reasoning  about  the  performance  of  algorithms  on  modern  architectures.  This  framework  is  unique  in 
that  it  explicitly  penalizes  algorithmic  implementations  that  make  use  of  long  wires  and/or  large  local  memories  while 
rewarding  those  algorithms  that  can  efficiently  execute  using  only  a  small,  bounded  amount  of  local  memory  and  a 
next-neighbor  interconnect  network.  Thus,  this  framework  makes  an  excellent  match  for  evaluating  architectures  faced 
with  the  growing  physical  concerns  of  wire  delay  [3]  and  the  energy  dissipation  of  on-chip  memory  [2,  5], 

Stream  algorithms  are  therefore  important  because  the  existence  of  a  stream  algorithm  for  a  particular  problem 
implies  a  scalable,  computation,  energy,  and  area  efficient  solution  to  that  problem  on  many  real-world  architectures. 
Stream  algorithms  decouple  memory  access  from  computation,  performing  memory  access  on  tiles  on  the  periphery 
of  the  chip  and  performing  computation  in  a  systolic  fashion  on  the  tiles  in  the  interior  of  the  chip.  For  a  problem  of 
size  TV  on  an  R  x  R  array  of  tiles,  the  efficiency  of  the  problem  is  the  total  number  of  operations  C(N)  divided  by  the 
product  of  the  number  of  cycles  T(N,  R)  and  the  total  number  of  memory  tiles  M ( R )  and  compute  tiles  P(R ), 

p,»r  m  C(N)  m 

1  ’  ’  T(N,R)*{M{R)  +  P(R))'  K) 

For  a  conventional  architecture,  the  total  number  of  tiles  is  equal  to  1 .  A  necessary  condition  for  E (TV,  R)  to  scale  with 
the  size  of  the  array  is  that  M (R)  be  asymptotically  smaller  than  P(R).  An  algorithm  that  meets  the  requirement  that 
P(R)  =  o{M(R))  is  decoupling  efficient ,  because  it  efficiently  decouples  memory  accesses  from  computation  [11]. 
An  algorithm  is  computation  efficient  if  lim1Tt.R_>00  E{cr1  R)  =  1  where  a  =  N/R.  Computation  efficient  algorithms 
implemented  on  an  array  of  fixed  size  scale  toward  an  asymptotic  limit  on  performance  as  data  size  increases,  and  this 
asymptotic  limit  becomes  larger  as  the  array  size  R  increases.  Stream  algorithms  are  therefore  those  algorithms  that 
meet  the  computation  efficiency  condition.  Stream  algorithms  for  matrix  multiplication,  QR  factorization,  convolution, 
and  other  applications  have  been  discovered  and  implemented  on  the  Raw  cycle  accurate  simulator  [4].  Comparison  of 
these  algorithms  with  conventional  implementations  on  conventional  architectures  such  as  the  PowerPC  G4  [7]  shows 
that  stream  algorithms  have  the  potential  to  achieve  higher  efficiency  on  many  different  problems. 

This  presentation  focuses  on  understanding  when  a  stream  algorithm  exists  for  a  given  kernel.  We  do  so  by  con¬ 
sidering  the  directed  acyclic  graph  (DAG)  for  a  particular  implementation  of  the  kernel.  Nodes  in  the  DAG  represent 
inputs,  outputs,  or  intermediate  products  of  the  algorithm,  and  edges  from  node  A  to  node  B  in  the  DAG  show  that 
A  is  used  to  compute  B.  We  can  characterize  the  DAG  for  an  algorithm  by  the  ratio  of  inputs,  W,  to  the  number  of 
intermediate  products,  Q,  for  which  any  one  value  is  directly  required.  For  example,  in  an  algorithm  to  multiply  two 
TV  x  TV  matrices  A  and  B,  element  i,j  of  the  output  matrix  C  is  computed  as  Cij  =  ^fc=1  0^6^.  That  is,  for  each 
output,  there  are  W  =  2TV  inputs  used  and  a  total  of  Q  =  TV  intermediate  products  (the  partial  sums)  computed.  The 
stream  algorithm  implementation  of  matrix  multiply  meets  the  compute  efficiency  condition.  Matrix  multiplication  is 
an  example  of  a  kernel  with  a  constant  ratio  of  W  to  Q.  All  known  algorithms  -  including  QR,  SVD,  convolution  - 
with  a  constant  ratio  of  W  to  Q  have  implementations  that  meet  the  compute  efficiency  condition. 

In  contrast,  consider  an  algorithm  to  compute  the  FFT  of  a  length-TV  vector.  To  compute  any  particular  output  of  the 
FFT,  all  TV  inputs  are  required,  and  (as  is  well  known)  each  input  directly  contributes  to  log2  (TV)  intermediate  products. 

'This  work  sponsored  by  the  Defense  Advanced  Research  Projects  Agency  under  Air  Force  Contract  F1962B-00-C-0002.  Opinions,  interpreta¬ 
tions,  conclusions,  and  recommendations  are  those  of  the  authors  and  are  not  necessarily  endorsed  by  the  United  States  Government. 
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FFT  Performance  on  Various  Architectures 


Figure  1 :  Comparison  of  Raw  FFT  throughput,  measured  in  the  Raw  cycle-accurate  simulator,  to  FFTW  throughput 
on  the  PowerPC  G4  and  Xeon. 


For  the  radix-2  FFT  algorithm,  the  ratio  of  W  =  O(N) to  Q  =  0(log2(Ar))  is  asymptotically  greater  than  a  constant. 
Because  any  stream  algorithm  for  the  FFT  must  meet  the  decoupling  efficiency  condition,  we  cannot  use  local  memory 
to  buffer  the  large  number  of  inputs.  Instead  inputs  must  reside  in  the  network  while  compute  tiles  are  working.  For  the 
FFT,  with  a  W/Q  =  0(N)/0(\og2{N)),  this  implies  that  the  maximum  distance  that  any  piece  of  data  must  travel  is 
greater  than  the  number  of  intermediate  calculations  in  which  the  data  is  used.  Therefore,  communication  costs  cannot 
be  effectively  amortized  in  the  systolic  implementation  on  a  tiled  architecture.  The  factor  T  in  the  denominator  of  the 
efficiency  expression  (1)  will  have  a  lower  bound  that  is  limited  by  the  size  of  the  array,  meaning  that  the  efficiency 
cannot  approach  a  limit  of  1  as  the  array  size  R  increases.  A  stream  algorithm  implementation  for  the  FFT  is  still 
an  open  research  problem.  Stream  algorithm  techniques  can  be  used  to  implement  an  efficient  implementation  of  the 
radix-4  FFT  for  a  4x4  tile  array,  but  this  implementation  is  not  scalable  and  performance  will  be  worse  on  larger  Raw 
systems.  Simulated  throughput  of  this  algorithm  is  compared  to  the  throughput  of  FFTW  [1]  on  the  2.8  GHz  Pentium 
4  and  733  MHz  G4  in  Figure  1.  The  Raw  FFT  achieves  high  performance  for  large  data  sizes,  and  offers  performance 
that  is  more  stable  across  a  range  of  data  sizes. 

In  this  talk,  we  will  describe  the  implementation  of  FFT,  QR  factorization,  and  CFAR  kernels  on  the  Raw  simulator 
and  Raw  board.  We  examine  the  performance  of  these  kernels  and  compare  to  conventional  implementations  on  the 
Pentium  and  G4  architectures.  Finally,  we  characterize  the  DAG  of  each  kernel  and  discuss  how  the  DAG  influences 
the  implementation  on  Raw  and  on  tiled  architectures  in  general. 
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Tiled  Architectures 


*  Monolithic  single-chip  architectures  are  becoming  rare  in 
the  industry 

-  Designs  become  increasingly  complex 

-  Long  wires  cannot  propagate  across  the  chip  in  one  clock 

*  Tiled  architectures  offer  an  attractive  alternative 

-  Multiple  simple  tiles  (or  “cores”)  on  a  single  chip 

-  Simple  interconnection  network  (short  wires) 

*  Examples  exist  in  both  industry  and  research 

-  IBM  Power4  &  Sun  Ultrasparc  IV  each  have  two  cores 

-  AMD,  Intel  expected  to  introduce  dual-core  chips  in  mid-2005 

-  DARPA  Polymorphous  Computer  Architecture  (PCA)  program 
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PCA  Block  Diagrams 


TRIPS  (University  of  Texas) 


RAW  (MIT) 


8x8x16  Grid  Processors 
,  Configurable  memory  banks 
Memory/10  channels 


Stitching/Move  logic  Register  file  bank 

LI  l-cache  banks  /  / 
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mm 

XX 
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DRAM  arrays 


Level  1  data  miss  control 

Memory  consistency  queues 
LI  D -cache  banks 


Network/routing  logic 


Branch  predictor/  Block  control  Express  channels 
l-fetch  control  /  .  Compute  node  (ALU) 

//\N 


HPC  applications 
DSP  streaming  applications 


C  desktop  applications 
Threaded  server  applications 


Smart  Memories  (Stanford) 


•  All  of  these  are  examples  of  tiled 
architectures 

•  In  particular,  RAW  is  a  4x4  array  of  tiles 

-  Small  amount  of  memory  per  tile 

-  Scalar  operand  network  allows  delivery 
of  operands  between  functional  units 

-  Plans  for  a  1024-tile  RAW  fabric 

•  This  research  aims  to  develop 
programming  methods  for  large  tile 
arrays 


HPEC  2004-4 
JML  28  Sep  2004 


MIT  Lincoln  Laboratory 


Outline 


*  Introduction 

*  Stream  Algorithms  and  Tiled  Architectures 

*  Mapping  Signal  Processing  Kernels  to  RAW 

*  Conclusions 
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Stream  Algorithms  for  Tiled 
Architectures 


Decoupled  Systolic  Architecture 


Time 


Space 


M(R)  edge 
tiles  are 
allocated  to 
memory 
management 


P(R)  inner  tiles 
perform 
computation 
systolically  using 
registers  and  static 
network 


T(N,R)*(P(R)  +  M(R)) 


Stream  Algorithm  Efficiency: 

where  E  <N’R>  = 

N  =  problem  size 
R  =  edge  length  of  tile  array 
C(N)  =  number  of  operations 
T(N,R)  =  number  of  time  steps 
P(R)  +  M(R)  =  total  number  of  tiles 

Compute  Efficiency  Condition:  lim  E(cr,R)  =  1 

<J,R  —>  GO 

where  <j  =  N/R 


Stream  algorithms  achieve  high  efficiency  by: 
-Partitioning  the  problem  into  sub-problems 
-Decoupling  memory  access  from  computation 
-Hiding  communication  latency 


HPEC  2004-6 
JML  28  Sep  2004 


MIT  Lincoln  Laboratory 


Example  Stream  Algorithm 
Matrix  Multiply 


•  Calculate  C=A  B 

-  Partition  A  into  N/R  row  blocks,  N  =  Problem  size 

B  into  N/R  column  blocks  R  =  edge  length  of  tile  array 


C  =  A  B 


•  In  each  phase,  compute  R2  elements  of  C 

-  Involves  2N  operations  per  tile 

-  N2/R2  phases 


*  Computations  can  be  pipelined 

-  Cost  is  2R  cycles  to  start  and  drain 
the  pipeline 

-  R  cycles  to  output  the  result 


Memory 

tiles 


Efficiency  Calculation: 


E(N,R)  = 


2A/3 


(2A/f/V2/R2;+3R)(R2+2R) 
2  a3  R 


for  CT  =  N/R 


2^+3  R+2 

lim  E(ct,R)  =  1 

<7,R  —>  cc 


Achieves  high  efficiency  as  array  size  (N)  &  data  size  (R)  grow 
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Matrix  Multiply  Efficiency 
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Matrix  multiplication 


-  Ideal  DSA 

-  -  RAW  Simulator 


Ql - 1 - , - 1 - 1 - . - 

2  3  4  5  6  7  8 

log(N/R) 

Assume  a  4x4  decoupled  systolic 
architecture  or  RAW  surrounded  by 
memory  tiles  (max  efficiency=66%) 


Scaled  overall  efficiency 


Compute  Processors 

Scale  the  number  of  overall  tiles 
Smaller  percentage  of  tiles  devoted  to 
memory  leads  to  higher  efficiency 


•  Stream  algorithms  achieve  high  efficiency  on  large  tile  arrays 

•  We  need  to  identify  algorithms  that  can  be  recast  as  stream  algorithms 
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Analyzing  the  Matrix  Multiply 


Consider  the  matrix  multiply  computation  in  more  detail 

To  compute  cij,  row  i  of  A  is  multiplied  by  column  j  of  B 

•  2N  inputs  required 

•  2N  operations  required 

cl  1  (*ZZI  *  Examine  the  directed  acyclic  graph 
(DAG)  for  the  matrix  multiply 

•  For  each  output  produced 

c22  •  There  are  W  inputs  *  required  (O(N)) 

•  The  input  i  is  used  iz)  Qj  times  (O(N)) 

-  These  are  intermediate  products 

C2i  *  The  matrix  multiply  is  an  example  of  an 
algorithm  with  a  constant  ratio  of  input 
data  (W)  to  intermediate  products  (Q) 


c12 


A  constant  W/Q  implies  a  degree  of  scale-invariance: 

•  Communication  and  computation  maintain  the  same  ratio  as  N  increases 

*  Therefore  the  implementation  can  efficiently  use  more  tiles  on  large  problems 

Lincoln  Laboratory  ™ 
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RAW  Test  Board 


•  Write  kernels  to  run  on  prototype  RAW  board 

-  4x4  RAW  chip,  100  MHz 

9  MIT  software  includes  cycle-accurate  simulator 

-  Code  written  for  the  simulator  easily  runs  on  board 

-  Initial  tests  show  good  agreement  between  simulator  and  board 
9  Expansion  connector  allows  direct  access  to  RAW  static  network 

-  Firmware  re-programming  required 

-  External  FPGA  board  streams  data  into  and  out  of  RAW 

-  Design  streams  data  into  ports  on  corner  tiles 

-  Interface  is  not  yet  complete  so  present  results  are  from  simulator 


Typical  RAW  configuration  for  a 
stream  algorithm  on  prototype  board: 

I/O  tiles 

•  Stream  data  to  and  from  outside  world 

Memory  tiles 

•  Store  intermediate  values 

•  Stream  data  to  and  from  computation  tiles 

Computation  tiles 

•  Perform  computation  systolically 

•  Use  static  network  and  registers 
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QR  Factorization  Mapping 


Algorithm  to  compute  A=QR: 

For  each  block  of  columns 
compute  Givens  rotations 
apply  Givens  rotation  to  A 


For  a  matrix  A  with 


Column  block 


six  columns: 


Data  flow  during  rotation  computation  Data  flow  during  rotation  application 


•  I/O  tiles  are  only  used  at  start  and  end  of  process 

-  In-between,  data  is  stored  in  memory  tiles 

•  This  shows  the  flow  for  odd-numbered  column  blocks 

-  For  even-numbered  blocks  of  columns,  data  flows  from  bottom  memory 
tiles  to  the  top  of  the  array 
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Complex  QR  Factorization  Performance 


Simulated  RAW  efficiency  for  Complex  QR  on  Square  Matrices 


Projected  Asymptotic  Efficiency  on  Scaled  Versions  of  RAW 


100 


The  QR 

factorization  has 
a  constant  ratio 
of  input  data  (W) 
to  intermediate 
products  (Q) 


P(R) 

M(R) 


8  16 
Array  dimension,  R 


R 

Nsn 

Projected  matrix  size  N80 

4 

64 

to  achieve  80%  efficiency 

8 

128 

on  compute  tiles  P(R ): 

16 

256 

32 

512 

The  QR  factorization  efficiency  scales 
100%  as  array  and  data  size  increase 


3 
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Convolution  (Time  Domain)  Mapping 
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Filter 
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Input  Vector 
Filter 


Stream  1 


Result 


Compute  Tiles  Memory  and  I/O  Tiles 


n-1 


k-1 


•  Filter  coefficients  distributed  cyclically  to  tiles 

-  Each  compute  tile  convolves  the  input  with  a  subset  of  the  filter 

-  Assume  n  (data  length)  >  k  (filter  length) 

9  Each  stream  is  a  different  convolution  operation 

-  In  multichannel  signal  processing  applications  we  rarely  perform 
just  one  convolution 

•  12  of  16  tiles  used  for  computation 

-  Maximum  75%  efficiency 
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Time  (seconds) 


Convolution  Performance 


9  Convolution  achieves  good  performance  in  RAW  simulator 


*  Longer  filters  and  input  vectors  are  more  efficient 

9  Longer  input  vectors  are  also  more  easily  mapped  to  more 
processors 


HPEC  2004-15 
JML  28  Sep  2004 


MIT  Lincoln  Laboratory 


CFAR  Mapping 


•  Constant  False-Alarm  Rate  (CFAR)  Detection 

•  For  each  output: 

-  There  are  W  =  0(Ncfar)  inputs  required 

-  The  input  i  is  used  Qj  =  0(1)  times 


•  For  a  long  stream,  CFAR  requires  7  ops/cell 
9  Consider  dividing  up  a  stream  over  R  tiles 

-  7/R  operations  per  tile 

-  N  communication  steps  per  tile 

-  Communication  quickly  dominates  computation 
9  Instead  consider  parallel  processing  of  streams 
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CFAR  Mapping 


•  Constant  False-Alarm  Rate  (CFAR)  Detection 

•  For  each  output: 

-  There  are  W  =  0(Ncfar)  inputs  required 

-  The  input  i  is  used  Qj  =  0(1)  times 

•  Goal  is  to  move  data  through  the  chip  as  fast  as 
possible 


s - -  • 


This  implementation  does  not 
scale  with  array  size  R 


-  As  R  increased,  there  would  be  a 
greater  latency  involved  in  using 
files  in  the  center  of  the  chip 


Data  cube  is  streamed 
into  RAW  using  the  static 
network 

Corner  input  ports 
receive  data 

Each  quadrant  processes 
data  from  one  port 

One  row  of  range  data 
(“one  stream”)  is 
processed  by  a  single  tile 

Results  gathered  to 
corner  tile  and  output 
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Efficiency  (%) 


CFAR  Performance 


Stream  fits  in  cache  Stream  does  not  fit  in  cache 


•  CFAR  achieves  an  efficiency  of  11-15% 

-  Efficiency  on  conventional  architectures  =  5-10%,  similarly  optimized 

-  RAW  implementation  benefits  from  large  off-chip  bandwidth 


•  Compute  tile  efficiency  does  not  scale  to  100%  as  for  Stream 
Algorithms  (matrix  multiply,  convolution,  QR) 
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Data  Flow  for  the  FFT 


Cooley-Tukey  Radix-2  FFT: 

For  each  of  (log2N)  stages 
compute  N/2  "butterflies" 


a 


b 


a+oob 


a-cob 


Radix-2  butterfly: 

•  2  complex  inputs 

•  precomputed  weight  co 

•  10  real  operations 


•  For  each  output  v^l  produced 

•  There  are  W  inputs  *  required  (O(N)) 

•  The  input  i  is  used  □  Q|  times 
(0(log2N)) 


-  These  are  intermediate  computations 


•  W/Q  is  0(N/log2N) 

-  As  N  increases,  communication  requirements  grow  faster 
than  computation 

-  Therefore  we  expect  that  the  Radix-2  FFT  cannot  efficiently 
scale 
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Mapping  the  Radix-2  FFT  to  a  Tile  Array 


0  4 

1  5 

2  6 


3  7 

1  5 

2  6 


Stage  2: 


0  4  15-4 


2  6  3  7-4 


0  4 

2  6 

1  5 


•  For  each  butterfly: 

-  4  +  (R-1)  cycles  to  clock  inputs  across  the  array 

-  10/R  computations  per  tile 

-  When  R=2,  tiles  are  used  efficiently 

Can  overlap  computation  (5  cycles)  and  communication  (5  cycles) 

-  When  R>2,  cannot  use  tiles  efficiently 

Latency  to  clock  inputs  >  number  of  ops  per  tile 

•  For  each  stage: 

-  Pipeline  N/2  butterflies  on  R  rows  or  columns 

•  Overall  efficiency  limited  to  50% 

-  2x2  compute  tiles  +  4  memory  tiles 
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Mapping  the  Radix-R  FFT  to  a  Tile  Array 


Idea:  use  a  Radix-R  FFT  algorithm  on  an  R  by  R  array 

•  A  Radix-R  FFT  algorithm 

-  Uses  logRN  stages 

-  Compute  N/R  Radix-R  butterflies  per  stage 

•  Implement  the  radix-R  butterfly  with  an  R-point  DFT 

-  W,  Q  both  scale  with  R  for  a  DFT 

-  Allows  us  to  use  more  processors  for  each  stage 

-  Still  becomes  inefficient  as  R  gets  “too  large” 

-  Efficiency  limit  for  radix-4  algorithm  =  56% 

-  Efficiency  limit  for  radix-8  algorithm  =  54% 

9  Radix-4  implementation: 

-  Distribute  a  radix-4  butterfly  over  4  processors  in  a  row  or  column 

-  Perform  4  butterflies  in  parallel 

-  8  memory  tiles  required 
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Mflop/s 


Radix-4  FFT  Algorithm  Performance 


Simulated  Radix-4  FFT  on  4x4  RAW  plus  8  memory  tiles 

FFT  Throughput  on  Various  Architectures  FFT  Efficiency  on  Various  Architectures 


Example:  Radix-4  FFT  algorithm  achieves  high  throughput  on  4x4  RAW 


-  Comparable  efficiency  to  FFTW  on  G4,  Xeon 

Raw  efficiency  stays  high  for  larger  FFT  sizes 


G4,  Xeon  FFT  results  from 

http : //www . f f tw . org/benchf f t 
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Classifying  Kernels 


Kernels  may  be  classified  by  the  ratio  W/Q 

•  Constant  Ratio:  W  =  O(N),  =  O(N) 

-  e.g.,  Matrix  Multiply,  QR,  Convolution 

-  Stream  algorithms:  efficiency  approaches  1  as 
R,  N/R  increase 

•  Sub-Linear  Ratio:  W=0(N),  Qj  <  O(N); 

-  e.g.,  FFT 

-  Require  trade-off  between  efficiency  and 
scalability 

•  Linear  Ratio:  W  =  O(N),  Qj  =  0(1); 

-  e.g.,  CFAR 

-  Difficult  to  find  efficient  or  scalable 
implementation 


Examining  W/Q  gives  insight  into  whether  a  stream  algorithm 

exists  for  the  kernel 


HPEC  2004-23 
JML  28  Sep  2004 


MIT  Lincoln  Laboratory 


Conclusions 


*  Stream  algorithms  map  efficiently  to  tiled  arrays 

-  Efficiency  can  approach  100%  as  data  size  and  array  size 
increase 

-  Implementations  on  RAW  simulator  show  the  efficiency  of 
this  approach 

-  Will  be  moving  implementations  from  simulator  to  board 

*  The  communication-to-computation  ratio  W/Q  gives  insight 
into  the  mapping  process 

-  A  constant  W/Q  seems  to  indicate  a  stream  algorithm  exists 

-  When  W/Q  is  greater  than  a  constant  it  is  hard  to  efficiently 
use  more  processors 

*  This  research  could  form  the  basis  for  a  methodology  of 
programming  tile  arrays 

-  More  research  and  formalism  required 
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